项目GitHub地址:https://github.com/dongguaguaguagua/Homework_2021 持续更新中。

1 Part I: Differential expression

In this HW, we will evaluate the differentially expressed genes and pathways between breast cancer and normal breast tissues. Our collaborator generated RNA-seq on ten pairs of breast tumors and matched adjacent normal tissues, located at /n/stat115/2021/HW2/raw_data1. The experiments were run in two batches, each batch with 5 pairs of samples, and the batch information is provided in batch.csv. We have run Salmon for gene quantification which is provided in Cannon at /n/stat115/2021/HW2/raw_data1/Salmon_results. Remember to convert Ensembl ID to gene symbol, mapping between Ensembl id and gene symbol using “org.Hs.eg.db” package in R.

在这个作业中,我们将评估乳腺癌与正常乳腺组织之间的差异表达基因和通路。我们的合作者对十对乳腺肿瘤及其匹配的邻近正常组织进行了RNA测序,数据位于/n/stat115/2021/HW2/raw_data1。实验分为两个批次进行,每个批次包含5对样本,批次信息在batch.csv中提供。我们已使用Salmon进行基因定量,结果存储在/n/stat115/2021/HW2/raw_data1/Salmon_results的Cannon中。请记得将Ensembl ID转换为基因符号,使用R中的org.Hs.eg.db包进行Ensembl ID与基因符号之间的映射。

1.1 Problem I.1

Please install the following R/Bioconductor packages. Then try “library(package)” to make sure the package works.

Note: sva package with Combat function is used for batch effect removal;

DESeq2 package is used for differential gene expression analysis;

tximport package is used for importing transcript-level abundance into gene-level matrices for downstream analysis

ggplot2 package is used for general plotting in R;

pheatmap package is used for heatmap plotting;

dplyr package is used for data frame manipulations in R;

fgsea package is used for gene-set enrichment analysis.

library(ggplot2)
library(sva)
library(DESeq2)
library(tximport)
library(dplyr)
library(fgsea)
library(pheatmap)
library(ComplexHeatmap)
library(clusterProfiler)
library(factoextra)
library(FactoMineR)
library(ape)
library(ggtree)
library(Cairo)
library(org.Hs.eg.db)

1.2 Problem I.2

For RNA-seq analysis, visualization using principle component analysis (PCA) or hierarchical clustering is an essential step of exploratory analysis to identify potental batch effect. Please import transcript-level TPM values from Salmon output and convert to gene-level TPM values. Perform PCA of the samples using log2 transformed TPM values. Indicate different tissues (tumor or normal, using shape to show) and batch (1 or 2, using color to show) of each sample. Next try to use hierarchical clustering for these samples.

Do the results suggest the presence of a batch effect?

For this question, you will load Salmon output at /n/stat115/2021/HW2/raw_data1/Salmon_results. You also need to read in batch information provided in /n/stat115/2021/HW2/raw_data1/batch.csv. Remember to convert Ensembl ID to gene symbol, mapping between Ensembl id and gene symbol using “org.Hs.eg.db” package in R.

对于RNA-seq分析,使用主成分分析(PCA)或层次聚类进行可视化是探索性分析中的一个重要步骤,以识别潜在的批次效应。请从Salmon输出中导入转录本级TPM值,并转换为基因级TPM值。使用log2转换的TPM值对样本进行PCA。使用形状区分不同组织(肿瘤或正常),使用颜色区分批次(1或2)来表示每个样本。接下来,尝试对这些样本进行层次聚类。

结果是否暗示存在批次效应?

对于这个问题,您需要加载位于/n/stat115/2021/HW2/raw_data1/Salmon_results的Salmon输出。同时,您还需要读取位于/n/stat115/2021/HW2/raw_data1/batch.csv中的批次信息。请记得将Ensembl ID转换为基因符号,使用R中的org.Hs.eg.db包进行Ensembl ID与基因符号之间的映射。

# 加载必要的库
library(rtracklayer)

### 1. 读取样本信息

# Directory where your Salmon output files are stored
salmon_dir <- "/home/huzongyao1/HW/HW2/raw_data1/HW2_sf"
# 获取所有样本的文件路径
sf_files <- list.files(salmon_dir, pattern = "*.sf", full.names = TRUE)
names(sf_files) <- c("N1", "N2", "N3", "N4", "N5", "T1", "T2", "T3", "T4", "T5")
### 2. 建立转录本ID到基因ID的映射 tx2gene

# tx_names 是包含所有 ENSEMBLTRANS ID 的向量
tx_names <- tximport(
    sf_files,
    type = "salmon",
    txOut = TRUE,
)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10
tx_names <- rownames(tx_names$abundance)
# 去除 TxVersion
tx_names <- sub("\\..*$", "", tx_names)

# 建立转录本ID到基因ID的映射
tx2gene <- bitr(
    tx_names,
    fromType = "ENSEMBLTRANS",
    toType = "SYMBOL",
    OrgDb = org.Hs.eg.db
)
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(tx_names, fromType = "ENSEMBLTRANS", toType = "SYMBOL", :
## 89.51% of input gene IDs are fail to map...
### 3. 导入Salmon输出

txi <- tximport(
    sf_files,
    type = "salmon",
    tx2gene = tx2gene,
    ignoreTxVersion = TRUE,
    countsFromAbundance = "scaledTPM"
)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 
## removing duplicated transcript rows from tx2gene
## transcripts missing from tx2gene: 177630
## summarizing abundance
## summarizing counts
## summarizing length
# Access gene-level TPM and log-transform
tpm_matrix <- txi$abundance
log2_tpm_matrix <- log2(tpm_matrix + 1)
gene_variances <- apply(log2_tpm_matrix, 1, var)
log2_tpm_filtered <- log2_tpm_matrix[gene_variances != 0, ]
### 4. 主成分分析:处理批次效应前

# 读取批次信息并进行PCA
batch_info <- read.csv("/home/huzongyao1/HW/HW2/raw_data1/batch.csv")
pca_result <- prcomp(t(log2_tpm_filtered), scale. = TRUE)
pca_data <- as.data.frame(pca_result$x)
pca_data$X <- rownames(pca_data)
pca_data <- merge(pca_data, batch_info, by.x = "X", by.y = "X")

print(summary(pca_result))
## Importance of components:
##                            PC1     PC2     PC3      PC4      PC5      PC6
## Standard deviation     44.8812 36.3820 27.5487 26.01347 25.00449 21.87228
## Proportion of Variance  0.2876  0.1890  0.1084  0.09663  0.08928  0.06831
## Cumulative Proportion   0.2876  0.4767  0.5850  0.68165  0.77093  0.83924
##                             PC7     PC8      PC9      PC10
## Standard deviation     21.62437 19.3736 16.81753 6.634e-14
## Proportion of Variance  0.06677  0.0536  0.04039 0.000e+00
## Cumulative Proportion   0.90602  0.9596  1.00000 1.000e+00

PCA结果为PC1占比28.76%,PC2占比18.9%,这两个主成分共解释了47.67%的方差。

### 5. 绘制PCA图

ggplot(pca_data, aes(x = PC1, y = PC2, shape = tissue, color = as.factor(batch))) +
    geom_point(size = 3, alpha = 0.7) +
    labs(title = "PCA of Gene Expression Data", x = "PC1", y = "PC2") +
    theme_minimal()

从上图中可以看出,肿瘤和正常组织在前两个主成分中有一定的分离,但是批次效应也很明显(同一批次的样本明显聚在一起)。因此,我们需要对数据进行批次效应的调整。

### 6. 层次聚类

# 计算样本之间的距离矩阵
dist_matrix <- dist(t(log2_tpm_matrix))
# 进行层次聚类
hc <- hclust(dist_matrix)
# 绘制层次聚类树
plot(hc, labels = batch_info$X)

1.3 Problem I.3

Run COMBAT on the samples to remove the batch effect. Visualize the results using a similar PCA and hierarchical clustering as Problem 2. Provide evidence that the batch effects are successfully adjusted.

运行COMBAT算法对样本进行批次效应的去除。使用与问题2类似的PCA和层次聚类来可视化结果。提供证据证明批次效应已成功调整。

### 1. 选择那些通过方差过滤的基因

# 设定一个方差阈值(可以调整),只保留高变异的基因
variance_threshold <- quantile(gene_variances, 0.75) # 选择方差在前25%的基因
filtered_genes <- rownames(log2_tpm_matrix)[gene_variances > variance_threshold]
# 提取这些基因的TPM值矩阵
filtered_tpm_matrix <- log2_tpm_matrix[filtered_genes, ]
### 2. 使用COMBAT去除批次效应

# 从批次信息中提取批次
batch <- as.factor(batch_info$batch)
# 运行Combat算法,去除批次效应
combat_corrected <- ComBat(
    dat = filtered_tpm_matrix,
    batch = batch,
    par.prior = TRUE,
    prior.plots = FALSE
)
## Found 26 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
### 3. 批次效应去除后的PCA

# 对经过Combat调整后的数据进行PCA
combat_pca_result <- prcomp(t(combat_corrected), scale. = TRUE)

# 提取PCA结果
combat_pca_data <- as.data.frame(combat_pca_result$x)
combat_pca_data$X <- rownames(combat_pca_data)
combat_pca_data <- merge(combat_pca_data, batch_info, by.x = "X", by.y = "X")

summary(combat_pca_result)
## Importance of components:
##                            PC1     PC2     PC3      PC4      PC5      PC6
## Standard deviation     36.2332 19.5800 18.2554 14.96392 14.23441 12.62325
## Proportion of Variance  0.4493  0.1312  0.1140  0.07663  0.06934  0.05453
## Cumulative Proportion   0.4493  0.5805  0.6946  0.77119  0.84053  0.89506
##                             PC7      PC8     PC9      PC10
## Standard deviation     10.90673 10.53464 8.75762 1.598e-14
## Proportion of Variance  0.04071  0.03798 0.02625 0.000e+00
## Cumulative Proportion   0.93577  0.97375 1.00000 1.000e+00

PCA结果为PC1占比44.93%,PC2占比13.12%,这两个主成分共解释了58.05%的方差。

### 4. 可视化PCA结果

ggplot(combat_pca_data, aes(x = PC1, y = PC2, shape = tissue, color = as.factor(batch))) +
    geom_point(size = 3) +
    labs(title = "PCA After Batch Effect Removal", x = "PC1", y = "PC2") +
    theme_minimal()

从上图中可以看出,在前两个主成分中,批次效应被成功去除(相同批次样本之间的聚集现象消失)。而肿瘤和正常组织仍然有一定的分离。

### 5. 层次聚类分析

# 计算样本之间的距离矩阵
combat_dist_matrix <- dist(t(combat_corrected))
# 进行层次聚类
combat_hc <- hclust(combat_dist_matrix)
# 绘制层次聚类树
plot(combat_hc, labels = batch_info$X, main = "Hierarchical Clustering After Batch Effect Removal")

1.4 Problem I.4

Run DESeq2 based on paired samples adjusting for the batch effect to identify differentially-expressed genes between tumor and normal tissues. How many genes are expressed higher in tumors than normal. Let’s use 1) FDR < 0.01 and 2) Log2 fold change > 1 as the cutoff.

Note: please use the raw_count columns of the Salmon result and convert these to integer values for DESeq2.

Identify the top 5 most (by Log2FC) over expressed genes (FDR < 0.01) in tumor and normal, respectively.

基于成对样本运行DESeq2,调整批次效应,以识别肿瘤与正常组织之间的差异表达基因。问有多少基因在肿瘤中的表达水平高于正常组织。我们使用以下标准作为筛选条件:

  1. FDR < 0.01
  2. Log2倍数变化 > 1。

注意:请使用Salmon结果中的raw_count列,并将其转换为整数值以用于DESeq2分析。

分别识别在肿瘤和正常组织中表达上调最高的前5个基因(按Log2倍数变化排序,FDR < 0.01)。

#### DESeq2 from salmon

# 使用tximport加载Salmon结果
txi_raw <- tximport(
    sf_files,
    type = "salmon",
    tx2gene = tx2gene,
    ignoreTxVersion = TRUE,
    countsFromAbundance = "no" # 使用原始计数(raw_count),而不是归一化后的TPM值
)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 
## removing duplicated transcript rows from tx2gene
## transcripts missing from tx2gene: 177630
## summarizing abundance
## summarizing counts
## summarizing length
# 提取基因级的raw_count(即NumReads)并确保为整数格式
raw_counts <- round(txi_raw$counts)
print(head(raw_counts, 10))
##           N1  N2   N3   N4   N5   T1   T2   T3   T4   T5
## A3GALT2    2   0    1    0    0    2    0    0   14   25
## A4GNT      0  25   20    2   11    3    0    0    1   12
## AADACL2    0   5    7    2   13    0    3    0    0    0
## AADACL4    0   0    0    1    3    0    0    0    0    0
## AARD      83 236  172   67  483   42   34    2  171  152
## AARS1P1    0   0    0    0    0    0    0    0    0    0
## AARS2    200  84  450  505  704  708  257  246  791 1122
## AARSD1   591 796  663  690  696  661 1008 1136  937  374
## AARSD1P1   8   2   25    6    0    0   12    7   22    8
## AATF     832 943 1102 1070 1024 1456 2070 3336 1582 2925
### extract differentially expressed genes

# 定义样本分组信息(正常 vs 肿瘤)
col_data <- batch_info
# 将tissue和batch转换为因子
col_data$tissue <- factor(col_data$tissue)
col_data$batch <- factor(col_data$batch)

# 创建DESeq2对象,调整批次效应
dds <- DESeqDataSetFromMatrix(
    countData = raw_counts,
    colData = col_data,
    # Deseq2 中加入batch因子以去除批次效应
    design = ~ batch + tissue
)
## converting counts to integer mode
# 保证每个基因至少有一个样本有足够的读取数
dds <- dds[rowSums(counts(dds)) > 1, ]
# 运行DESeq2并进行差异表达分析
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# 提取结果:比较肿瘤 vs 正常,筛选显著性差异表达基因
dds_res <- results(dds, contrast = c("tissue", "Tumor", "Normal"))
# 筛选符合FDR < 0.01和Log2倍数变化 > 1的基因
res_filtered <- dds_res[which(dds_res$padj < 0.01 & abs(dds_res$log2FoldChange) > 1), ]
# 检查肿瘤中表达水平高于正常组织的基因
upregulated_genes <- res_filtered[which(res_filtered$log2FoldChange > 1), ]
# 提取在肿瘤中表达最高的前5个基因(按Log2倍数变化排序)
top5_upregulated <- head(upregulated_genes[order(-upregulated_genes$log2FoldChange), ], 5)

# 检查肿瘤中表达水平低于正常组织的基因
downregulated_genes <- res_filtered[which(res_filtered$log2FoldChange < -1), ]
# 提取在正常组织中表达最高的前5个基因(按Log2倍数变化排序)
top5_downregulated <- head(downregulated_genes[order(downregulated_genes$log2FoldChange), ], 5)

# 输出在肿瘤中的表达水平高于正常组织的基因数量
cat("Number of genes upregulated in tumor:", nrow(upregulated_genes), "\n")
## Number of genes upregulated in tumor: 615
# 输出前5个基因的基因符号和Log2FoldChange
cat("Top 5 upregulated genes in tumor:\n")
## Top 5 upregulated genes in tumor:
print(top5_upregulated[, c("log2FoldChange", "padj")])
## log2 fold change (MLE): tissue Tumor vs Normal 
##  
## DataFrame with 5 rows and 2 columns
##           log2FoldChange        padj
##                <numeric>   <numeric>
## EEF1A1P44       21.19958 4.52454e-13
## SOX1             9.85882 1.63972e-27
## NGB              9.66323 1.41880e-37
## SIX6             9.55745 1.02131e-13
## ZDHHC22          9.31632 3.29360e-16
cat("\nTop 5 downregulated genes in normal tissue:\n")
## 
## Top 5 downregulated genes in normal tissue:
print(top5_downregulated[, c("log2FoldChange", "padj")])
## log2 fold change (MLE): tissue Tumor vs Normal 
##  
## DataFrame with 5 rows and 2 columns
##          log2FoldChange        padj
##               <numeric>   <numeric>
## SMIM10L1      -23.55721 1.15369e-12
## DEFA5         -23.49243 1.28811e-12
## DEFA1B        -21.92406 1.51378e-13
## DEFA1          -8.87645 4.92360e-06
## ITLN1          -8.41631 3.42619e-08
# 提取前5个上调基因

# 提取这些基因的原始计数
top5_combined_counts <- raw_counts[
    rownames(raw_counts) %in% union(
        rownames(top5_upregulated),
        rownames(top5_downregulated)
    ),
]

# 计算肿瘤和正常组织中的平均表达量
avg_counts <- data.frame(
    gene = rownames(top5_combined_counts),
    normal = rowMeans(top5_combined_counts[, col_data$tissue == "Normal"]),
    tumor = rowMeans(top5_combined_counts[, col_data$tissue == "Tumor"])
)

# 将数据转换为长格式
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:ggtree':
## 
##     expand
## The following object is masked from 'package:S4Vectors':
## 
##     expand
avg_counts_long <- avg_counts %>%
    pivot_longer(
        cols = c(normal, tumor),
        names_to = "tissue",
        values_to = "count"
    )

# 绘制前5个基因的柱状图
ggplot(avg_counts_long, aes(x = reorder(gene, -count), y = count, fill = tissue)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Top 5 Upregulated Genes in Tumor vs Normal", x = "Gene", y = "Average Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

1.5 Problem I.5

Visualize the differential gene expression values by making a volcano and an MA plot to summarize the differences between tumor and normal. In the volcano plot, draw the statistically significant (FDR < 0.01, Log2FC > 1) tumor up-genes red and down-genes blue.

Note: Be sure to use the lfcShrink function to get more robust estimates of the fold-changes for genes.

可视化基因差异表达值,通过火山图和MA图总结肿瘤与正常之间的差异。在火山图中,将具有统计显著性的基因(FDR < 0.01,Log2FC > 1)显示为红色的上调基因和蓝色的下调基因。

注意:请确保使用 lfcShrink 函数,以获得基因倍数变化的更稳健估计。

### lfcShrink to get more robust estimates of fold changes for genes

# 使用 apeglm 方法收缩 Log2FC,确保结果更稳健
res_shrunken <- lfcShrink(dds, coef = "tissue_Tumor_vs_Normal", type = "apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
### Volcano plot

# 转换为数据框以便使用 ggplot2
res_shrunken_df <- as.data.frame(res_shrunken)

# 添加颜色标记列:红色表示上调基因,蓝色表示下调基因,灰色表示无显著性差异基因
res_shrunken_df$color <- "Non-significant"
res_shrunken_df$color[res_shrunken_df$padj < 0.01 & res_shrunken_df$log2FoldChange > 1] <- "Up Regulated Genes"
res_shrunken_df$color[res_shrunken_df$padj < 0.01 & res_shrunken_df$log2FoldChange < -1] <- "Down Regulated Genes"

# 绘制火山图
ggplot(res_shrunken_df, aes(x = log2FoldChange, y = -log10(padj), color = color)) +
    geom_point(alpha = 0.8, size = 1.5) +
    scale_color_manual(values = c(
        "Down Regulated Genes" = "blue",
        "Up Regulated Genes" = "red",
        "Non-significant" = "grey"
    )) +
    labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 Adjusted P-value") +
    theme_minimal()
## Warning: Removed 1672 rows containing missing values or values outside the scale range
## (`geom_point()`).

# 绘制 MA 图
plotMA(res_shrunken, ylim = c(-5, 5), main = "MA Plot")

1.6 Problem I.6

Try kmeans (try k = 4 or 7) clustering to group differentially expressed genes into different clusters. How many genes are there in each cluster? Draw a heatmap showing gene clusters.

尝试使用 k-means 聚类(尝试 k = 4 或 7)对差异表达基因进行分组。每个聚类中有多少个基因?绘制展示基因聚类的热图。

# kmeans for differential genes

# 提取差异表达基因的表达矩阵
gene_data <- assay(dds)[rownames(res_filtered), ]
# 标准化基因表达数据(通常使用行标准化)
gene_data_scaled <- t(scale(t(gene_data)))
# 设置聚类数 k
k <- 7
# 进行 k-means 聚类
set.seed(114514) # 设置随机种子以获得可重复结果
kmeans_result <- kmeans(gene_data_scaled, centers = k)
# summarize gene number in each cluster

# 查看每个聚类中的基因数量
table(kmeans_result$cluster)
## 
##   1   2   3   4   5   6   7 
## 213 102  97 155 336 150  97
# heatmap showing genes with kmeans clustering

# 使用 k-means 聚类结果对基因重新排序
gene_data_ordered <- gene_data_scaled[order(kmeans_result$cluster), ]
# 使用 pheatmap 包创建热图并添加聚类标签
pheatmap(gene_data_ordered,
    cluster_rows = FALSE, cluster_cols = TRUE,
    annotation_row = data.frame(Cluster = factor(kmeans_result$cluster)),
    show_rownames = FALSE, show_colnames = TRUE,
    main = paste("Heatmap of Differentially Expressed Genes (k =", k, ")")
)
## Warning: Row annotation has different order from matrix rows. Adjust the row
## annotation based on row names of the matrix.

1.7 Problem I.7: For graduate students only

If you run DESeq2 without removing batch effect, how many differential genes do you get? How do their K-means clustering look? Does batch effect removal gives more interpretable results?

如果在不去除批次效应的情况下运行 DESeq2,你得到多少差异基因?它们的 K-means 聚类结果如何?去除批次效应后,结果是否更具可解释性?

但是,在尝试不去除批次效应的情况下运行 DESeq2 前,我想使用 limma::removeBatchEffect 函数进一步校正批次效应,并比较两种方法的效果。

library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
## 
##     plotMA
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
library(stats)

# 使用 limma::removeBatchEffect 进一步校正批次效应
adjusted_gene_data <- removeBatchEffect(gene_data, batch = col_data$batch)

# 进行 PCA 分析
adjusted_pca_res <- prcomp(t(adjusted_gene_data), scale. = TRUE)

# 提取前两个主成分
adjusted_pca_data <- as.data.frame(adjusted_pca_res$x)
adjusted_pca_data$X <- rownames(adjusted_pca_data)
adjusted_pca_data <- merge(adjusted_pca_data, col_data, by.x = "X", by.y = "X")

print(summary(adjusted_pca_res))
## Importance of components:
##                            PC1      PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     26.5981 10.20544 9.02130 8.59956 7.98769 6.57239 6.31378
## Proportion of Variance  0.6152  0.09057 0.07077 0.06431 0.05548 0.03756 0.03466
## Cumulative Proportion   0.6152  0.70575 0.77651 0.84082 0.89630 0.93386 0.96853
##                            PC8       PC9      PC10
## Standard deviation     6.01597 4.967e-15 3.865e-15
## Proportion of Variance 0.03147 0.000e+00 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00 1.000e+00
# 绘制 PCA 图
ggplot(adjusted_pca_data, aes(x = PC1, y = PC2, shape = tissue, color = as.factor(batch))) +
    geom_point(size = 3, alpha = 0.7) +
    labs(title = "PCA of Gene Expression Data", x = "PC1", y = "PC2") +
    theme_minimal()

# 对校正后的数据运行 K-means 聚类

# 标准化基因表达数据(通常使用行标准化)
adjusted_gene_data_scaled <- t(scale(t(adjusted_gene_data)))
# 设置聚类数 k
k <- 7
# 进行 k-means 聚类
set.seed(114514) # 设置随机种子以获得可重复结果
adjuested_kmeans_result <- kmeans(adjusted_gene_data_scaled, centers = k)
# 查看每个聚类中的基因数量
table(adjuested_kmeans_result$cluster)
## 
##   1   2   3   4   5   6   7 
## 210 135 111 153 155 141 245
# 使用 k-means 聚类结果对基因重新排序
adjusted_gene_data_ordered <- adjusted_gene_data_scaled[order(adjuested_kmeans_result$cluster), ]
# 创建热图并添加聚类标签
pheatmap(adjusted_gene_data_ordered,
    cluster_rows = FALSE, cluster_cols = TRUE,
    annotation_row = data.frame(Cluster = factor(adjuested_kmeans_result$cluster)),
    show_rownames = FALSE, show_colnames = TRUE,
    main = paste("[limma adjusted] Heatmap of Differentially Expressed Genes (k =", k, ")")
)
## Warning: Row annotation has different order from matrix rows. Adjust the row
## annotation based on row names of the matrix.

#### DESeq2 from salmon without batch removal

# 创建DESeq2对象
batched_dds <- DESeqDataSetFromMatrix(
    countData = raw_counts,
    colData = col_data,
    design = ~tissue
)
## converting counts to integer mode
# 保证每个基因至少有一个样本有足够的读取数
batched_dds <- batched_dds[rowSums(counts(batched_dds)) > 1, ]
# 运行DESeq2并进行差异表达分析
batched_dds <- DESeq(batched_dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
### extract differentially expressed genes

# 提取结果:比较肿瘤 vs 正常,筛选显著性差异表达基因
batched_dds_res <- results(batched_dds, contrast = c("tissue", "Tumor", "Normal"))
# 筛选符合FDR < 0.01和Log2倍数变化 > 1的基因
batched_res_filtered <- batched_dds_res[which(batched_dds_res$padj < 0.01 & abs(batched_dds_res$log2FoldChange) > 1), ]
# kmeans for differential genes

# 提取差异表达基因的表达矩阵
batched_gene_data <- assay(batched_dds)[rownames(batched_res_filtered), ]
# 标准化基因表达数据(通常使用行标准化)
batched_gene_data_scaled <- t(scale(t(batched_gene_data)))
# 设置聚类数 k
k <- 7
# 进行 k-means 聚类
set.seed(114514) # 设置随机种子以获得可重复结果
batched_kmeans_result <- kmeans(batched_gene_data_scaled, centers = k)

# summarize gene number in each cluster

# 查看每个聚类中的基因数量
table(batched_kmeans_result$cluster)
## 
##   1   2   3   4   5   6   7 
## 193  70 176 152 100 214 167
# heatmap showing genes with kmeans clustering

# 使用 k-means 聚类结果对基因重新排序
batched_gene_data_ordered <- batched_gene_data_scaled[order(batched_kmeans_result$cluster), ]
# 创建热图并添加聚类标签
pheatmap(batched_gene_data_ordered,
    cluster_rows = FALSE, cluster_cols = TRUE,
    annotation_row = data.frame(Cluster = factor(batched_kmeans_result$cluster)),
    show_rownames = FALSE, show_colnames = TRUE,
    main = paste("[batch] Heatmap of Differentially Expressed Genes (k =", k, ")")
)
## Warning: Row annotation has different order from matrix rows. Adjust the row
## annotation based on row names of the matrix.

1.8 Problem I.8

From the batch-removed DESeq2 run results, extract the top 200 tumor-upregulated genes (by Log2FC, FDR < 0.01). Run DAVID GO analysis (http://david.abcc.ncifcrf.gov/) to see whether these genes are enriched in specific biological process (BP), pathways, etc.

从去除批次效应的 DESeq2 运行结果中,提取前 200 个肿瘤上调基因(按 Log2FC 排序,FDR < 0.01)。运行 DAVID GO 分析 (http://david.abcc.ncifcrf.gov/) 以查看这些基因是否在特定的生物过程 (BP)、通路等方面富集。

## extract top 200 genes from up-regulated genes with batch removal
top200_upregulated <- head(upregulated_genes[order(-upregulated_genes$log2FoldChange), ], 200)

print(top200_upregulated)
## log2 fold change (MLE): tissue Tumor vs Normal 
## Wald test p-value: tissue Tumor vs Normal 
## DataFrame with 200 rows and 6 columns
##             baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##            <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## EEF1A1P44    14.2277       21.19958  2.747476   7.71602 1.20019e-14 4.52454e-13
## SOX1       1525.8448        9.85882  0.869075  11.34404 7.93971e-30 1.63972e-27
## NGB         721.3538        9.66323  0.726143  13.30760 2.09087e-40 1.41880e-37
## SIX6        128.9556        9.55745  1.207179   7.91718 2.42965e-15 1.02131e-13
## ZDHHC22     422.2062        9.31632  1.077728   8.64441 5.40843e-18 3.29360e-16
## ...              ...            ...       ...       ...         ...         ...
## C16orf92     2.80940        4.64746  1.405005   3.30779 9.40366e-04 4.23388e-03
## BRINP2      21.08145        4.64689  1.103059   4.21273 2.52302e-05 1.78073e-04
## FOXO6     1051.23834        4.63786  0.540185   8.58570 9.02880e-18 5.04551e-16
## C1QL2       96.67307        4.62456  0.909316   5.08575 3.66169e-07 3.81426e-06
## TMPOP2       8.62039        4.60181  1.288077   3.57262 3.53432e-04 1.82876e-03
# 用org.Hs.eg.db将SYMBOL映射为ENSEMBL
top200_gene_ids <- mapIds(
    org.Hs.eg.db,
    keys = rownames(top200_upregulated),
    column = "ENSEMBL",
    keytype = "SYMBOL",
    multiVals = "first"
)
## 'select()' returned 1:many mapping between keys and columns
## write gene list to file

write.table(
    rownames(top200_upregulated),
    file = "/home/huzongyao1/Homework_2020/HW2/output/top200_upregulated_gene_symbol.txt",
    quote = FALSE,
    row.names = FALSE,
    col.names = FALSE
)

write.table(
    top200_gene_ids,
    file = "/home/huzongyao1/Homework_2020/HW2/output/top200_upregulated_genes.txt",
    quote = FALSE,
    row.names = FALSE,
    col.names = FALSE
)

将 top200_upregulated_gene_symbol.txt 上传到 https://david.ncifcrf.gov/tools.jsp,Identifier选择 Official Gene Symbol,Select species 选择 Homo sapiens(人类),List Type选择 Gene List,点击 Submit List,查看结果:

  • Functional Annotation Clustering 给出了25个结果(Clusters)。其中排名第一的是富集程度最高的通路(Annotation Cluster 1、Enrichment Score: 6.367441043588468)

  • Functional Annotation Chart 生成了261个chart records

  • Functional Annotation Table 生成了191个records

1.9 Problem I.9: For graduate students only

Run Gene Set Enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp) using the summary statistics from problem 4. Show the top five gene sets or pathways that best capture the differentially expressed genes between tumor than normal. Comment on the biological relevance of the results. Plot GSEA enrichment plot for an interesting pathway.

Mapping between gene sets and pathways is provided in /n/stat115/2021/HW2/raw_data1/c2.cp.kegg.v7.4.symbols.gmt file file.

使用第4题中的总结统计数据运行基因集富集分析(GSEA)。展示能够最好地捕捉到肿瘤与正常差异表达基因的前五个基因集或通路,并对结果的生物学相关性进行评论。为一个有趣的通路绘制GSEA富集图。

基因集和通路之间的映射信息提供在 /n/stat115/2021/HW2/raw_data1/c2.cp.kegg.v7.4.symbols.gmt 文件中。

library(enrichplot)

### GSEA

## prepare pre-ranked gene list

# 创建带有 Log2FC 值的排序基因列表(必须是全部差异表达基因)
gene_list <- dds_res$log2FoldChange
names(gene_list) <- rownames(dds_res)
gene_list <- sort(gene_list, decreasing = TRUE)

## read in gmt file

# 加载 KEGG 基因集文件
kegg_genesets <- read.gmt("/home/huzongyao1/HW/HW2/raw_data1/c2.cp.kegg.v7.4.symbols.gmt")

## GSEA

# 运行 GSEA
gsea_result <- GSEA(gene_list, TERM2GENE = kegg_genesets, pvalueCutoff = 0.05)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (9.23% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
## extract significant pathways

# 显示显著富集的前五个基因集
top5_pathways <- head(gsea_result@result[order(gsea_result@result$p.adjust), ], 5)
print(top5_pathways)
##                                                                                                  ID
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                                 KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION             KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450                           KEGG_DRUG_METABOLISM_CYTOCHROME_P450
## KEGG_RETINOL_METABOLISM                                                     KEGG_RETINOL_METABOLISM
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450
##                                                                                         Description
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                                 KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION             KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450                           KEGG_DRUG_METABOLISM_CYTOCHROME_P450
## KEGG_RETINOL_METABOLISM                                                     KEGG_RETINOL_METABOLISM
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450
##                                                   setSize enrichmentScore
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                      75       0.6355435
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION            98      -0.4501634
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450                   14      -0.8098617
## KEGG_RETINOL_METABOLISM                                16      -0.7607557
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450      14      -0.7827067
##                                                         NES       pvalue
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                  2.250460 1.298355e-08
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION       -1.868671 3.256600e-05
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450              -2.164787 5.460117e-05
## KEGG_RETINOL_METABOLISM                           -2.130939 1.269299e-04
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 -2.092201 1.983719e-04
##                                                       p.adjust       qvalue
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                 1.090618e-06 8.336805e-07
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION       1.367772e-03 1.045540e-03
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450              1.528833e-03 1.168657e-03
## KEGG_RETINOL_METABOLISM                           2.665528e-03 2.037559e-03
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 3.332648e-03 2.547513e-03
##                                                   rank
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                 1029
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION       1101
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450               496
## KEGG_RETINOL_METABOLISM                            923
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450  496
##                                                                     leading_edge
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                 tags=59%, list=16%, signal=50%
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION       tags=38%, list=17%, signal=32%
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450               tags=64%, list=8%, signal=59%
## KEGG_RETINOL_METABOLISM                           tags=56%, list=14%, signal=48%
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450  tags=57%, list=8%, signal=53%
##                                                                                                                                                                                                                                                                                                          core_enrichment
## KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS                 H2BC17/H2AC16/H2AC14/H2BC13/H2AC12/H2BC9/H3C7/H3C3/H3C15/H2BC7/H3C11/H2AC21/H3C8/H2AC13/H2BC14/H4C4/H4C13/H2AC7/H2AC4/H3C12/H4C2/H2AC17/H3C1/H4C1/H2AC11/H4C9/H2BC6/H3C13/H3C2/H2AC8/H2AX/H2AC15/H3C4/H3C10/H3C6/H4C14/H4C11/H2BC11/H2BC8/H2BC3/H4C5/H2BC10/ACTN3/H4C3
## KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION                                                           XCL2/CCL24/CXCL1/OSM/CCL22/TNF/IL10/IFNK/CX3CL1/LEP/TNFRSF10D/TNFRSF1A/GH1/IFNA21/LIF/IFNA1/CTF1/CCL5/CCL13/BMP2/IL12A/CCL18/TNFSF9/CCL14/CXCR1/CSF2/CXCL3/CCL16/IL1A/PF4/CXCR2/CXCL2/CCL15/CCL23/PPBP/VEGFD/CXCL5
## KEGG_DRUG_METABOLISM_CYTOCHROME_P450                                                                                                                                                                                                                         UGT1A1/ADH1A/ADH1C/UGT1A4/CYP2A13/CYP2A6/GSTA1/CYP1A2/GSTA2
## KEGG_RETINOL_METABOLISM                                                                                                                                                                                                                                     UGT1A8/DHRS3/UGT1A1/ADH1A/ADH1C/UGT1A4/CYP2A13/CYP2A6/CYP1A2
## KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450                                                                                                                                                                                                                    UGT1A1/ADH1A/ADH1C/UGT1A4/GSTA1/CYP1A2/GSTA2/CYP2F1

对结果的生物学相关性进行评论(来自ChatGPT)

从基因集富集分析的结果来看:

  1. KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS (系统性红斑狼疮):富集分数为正值(0.6355435),表明与系统性红斑狼疮相关的基因集在您的数据中可能被上调。这意味着与系统性红斑狼疮相关的基因可能在您的样本中表现活跃,可能与免疫反应有关。
  2. KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION (细胞因子与细胞因子受体相互作用):负富集分数(-0.4501634)表明这个通路中的基因在您的样本中呈下调趋势。细胞因子在免疫调控和炎症反应中起重要作用,这种下调可能反映了细胞因子介导的信号通路在您的样本中被抑制或减少。
  3. KEGG_DRUG_METABOLISM_CYTOCHROME_P450 (药物代谢 - 细胞色素P450):富集分数较低(-0.8098617),表明与药物代谢相关的基因集下调显著。细胞色素P450酶在药物代谢和毒物清除中发挥重要作用,因此这可能暗示样本中的药物代谢活性较低。
  4. KEGG_RETINOL_METABOLISM (视黄醇代谢):该通路的负富集分数(-0.7607557)意味着与视黄醇代谢相关的基因在样本中下调。视黄醇(维生素A)参与细胞分化、视力和免疫功能,因此这个结果可能提示视黄醇代谢活动减少。
  5. KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 (细胞色素P450介导的外源物质代谢):富集分数为-0.7827067,这与药物代谢类似,说明与外源物质代谢相关的基因集下调。这种下调可能反映出样本中的解毒能力或代谢负荷较低。

总体分析

  • 免疫相关通路(如系统性红斑狼疮、细胞因子相互作用)的上调或下调可能提示样本中存在免疫反应的异常。
  • 代谢相关通路(如药物代谢、视黄醇代谢、外源物质代谢)的下调表明样本可能在代谢活性或解毒功能上受到影响。

如果这些样本是从肿瘤和正常组织中获得的,结果可能表明肿瘤组织中某些代谢或免疫功能的变化,例如代谢功能的抑制和免疫调节失衡。

## 峰峦图 展示富集基因的表达倍数变化的分布情况
ridgeplot(gsea_result) + ggtitle("Ridge Plot of Top Gene Sets")
## Picking joint bandwidth of 0.645

## 展示基因集的logFC分布
gseadist(gsea_result, gsea_result@result$ID) +
    ggtitle("GSEA Distribution of Top Gene Sets")

## plot GSEA enrichment plot for an interesting pathway

id <- 1 # 系统性红斑狼疮通路

## 展示基因在排序列表中的富集情况
gsearank(gsea_result, geneSetID = id) +
    ggtitle("GSEA Rank Plot of Top Gene Sets")

# 绘制 GSEA 富集图
gseaplot2(
    gsea_result,
    # 选择第一个显著通路
    geneSetID = top5_pathways$ID[id],
    title = top5_pathways$Description[id]
)

2 Part II: Sample classification

We provide you z-score normalized expression data of 50 breast tumor samples, 50 normal breast samples (your training and cross-validation data), and 20 samples without diagnosis (your testing data). We want to use the 100 samples with known diagnosis to train machine learning models in order to predict the 20 unknown samples.

You will need the following libraries in R: ggplot2 and ggfortify for plotting, MASS and caret for machine learning, and pROC is for evaluating testing performance. The YouTube video on caret (https://youtu.be/z8PRU46I3NY) and the package documentation (http://topepo.github.io/caret/index.html) might be helpful.

我们提供了50个乳腺肿瘤样本和50个正常乳腺样本的z-score标准化表达数据(作为你的训练和交叉验证数据),以及20个未诊断的样本(作为你的测试数据)。我们希望使用这100个已知诊断的样本来训练机器学习模型,以预测这20个未知样本。

你需要在R中使用以下库:ggplot2 和 ggfortify 用于绘图,MASS 和 caret 用于机器学习,pROC 用于评估测试表现。YouTube上关于caret的教学视频包的文档可能会对你有所帮助。

All data for Part II are provided at /n/stat115/2021/HW2/raw_data2.

library(ggplot2)
library(ggfortify)
library(pROC)
library(caret)
library(MASS)

#### read in data

zscore_data <- read.table("/home/huzongyao1/HW/HW2/raw_data2/BRCA_zscore_data.txt", header = TRUE, row.names = 1)
phenotype <- read.table("/home/huzongyao1/HW/HW2/raw_data2/BRCA_phenotype.txt", header = TRUE)
diagnosis <- read.table("/home/huzongyao1/HW/HW2/raw_data2/diagnosis.txt", header = TRUE)
unknown_samples <- read.table("/home/huzongyao1/HW/HW2/raw_data2/unknown_samples.txt", header = TRUE)

# str: structure of the data
str(zscore_data)
## 'data.frame':    100 obs. of  20501 variables:
##  $ A1BG           : num  -0.161 -0.155 -0.219 -0.128 -0.139 ...
##  $ A1CF           : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ A2BP1          : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ A2LD1          : num  -0.178 -0.138 -0.196 -0.123 -0.138 ...
##  $ A2M            : num  0.404 3.514 0.523 2.695 2.251 ...
##  $ A2ML1          : num  -0.199 -0.158 -0.218 -0.131 -0.157 ...
##  $ A4GALT         : num  -0.171 -0.1 -0.202 -0.1 -0.13 ...
##  $ A4GNT          : num  -0.199 -0.157 -0.222 -0.133 -0.156 ...
##  $ AAA1           : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ AAAS           : num  -0.0137 -0.0649 -0.0433 -0.0681 -0.0585 ...
##  $ AACS           : num  -0.08333 0.00541 0.07548 -0.05302 -0.09103 ...
##  $ AACSL          : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ AADAC          : num  -0.1992 -0.0687 -0.2216 -0.0979 -0.1567 ...
##  $ AADACL2        : num  -0.199 -0.158 -0.222 -0.132 -0.157 ...
##  $ AADACL3        : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ AADACL4        : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ AADAT          : num  -0.197 -0.146 -0.218 -0.122 -0.153 ...
##  $ AAGAB          : num  0.1694 -0.0807 0.1862 -0.047 0.0705 ...
##  $ AAK1           : num  -0.1298 -0.0902 -0.167 -0.104 -0.1064 ...
##  $ AAMP           : num  0.426 0.109 0.709 0.02 0.197 ...
##  $ AANAT          : num  -0.199 -0.158 -0.222 -0.132 -0.156 ...
##  $ AARS           : num  0.0052 0.0747 0.479 0.0191 0.0955 ...
##  $ AARS2          : num  -0.159 -0.147 -0.198 -0.121 -0.135 ...
##  $ AARSD1         : num  0.0225 -0.0645 -0.1196 -0.0567 0.0716 ...
##  $ AASDH          : num  -0.131 -0.124 -0.163 -0.101 -0.127 ...
##  $ AASDHPPT       : num  -0.0682 -0.0557 0.0254 -0.0547 -0.084 ...
##  $ AASS           : num  -0.1852 -0.0242 -0.2078 -0.0923 -0.1402 ...
##  $ AATF           : num  0.0392 -0.0558 -0.0281 -0.0402 -0.0294 ...
##  $ AATK           : num  -0.195 -0.157 -0.22 -0.131 -0.155 ...
##  $ ABAT           : num  0.2438 -0.1276 -0.1955 -0.1105 -0.0944 ...
##  $ ABCA1          : num  -0.1849 0.0482 -0.1967 -0.0902 -0.1341 ...
##  $ ABCA10         : num  -0.1984 -0.0847 -0.2219 -0.0992 -0.1542 ...
##  $ ABCA11P        : num  -0.162 -0.13 -0.192 -0.111 -0.139 ...
##  $ ABCA12         : num  -0.1978 -0.158 0.0492 -0.1317 -0.1112 ...
##  $ ABCA13         : num  -0.199 -0.158 -0.222 -0.131 -0.156 ...
##  $ ABCA17P        : num  -0.196 -0.157 -0.219 -0.132 -0.156 ...
##  $ ABCA2          : num  -0.1038 -0.0917 -0.0376 -0.1037 -0.0933 ...
##  $ ABCA3          : num  0.00671 -0.09309 -0.1419 -0.09832 -0.1105 ...
##  $ ABCA4          : num  -0.198 -0.157 -0.221 -0.124 -0.151 ...
##  $ ABCA5          : num  -0.191 -0.112 -0.205 -0.1 -0.146 ...
##  $ ABCA6          : num  -0.1946 -0.027 -0.2196 -0.0803 -0.1456 ...
##  $ ABCA7          : num  -0.131 -0.142 -0.191 -0.125 -0.126 ...
##  $ ABCA8          : num  -0.1964 0.2299 -0.2219 -0.0195 -0.1491 ...
##  $ ABCA9          : num  -0.19804 0.00115 -0.22174 -0.01253 -0.15086 ...
##  $ ABCB1          : num  -0.197 -0.146 -0.221 -0.115 -0.132 ...
##  $ ABCB10         : num  -0.105 -0.111 -0.166 -0.106 -0.11 ...
##  $ ABCB11         : num  -0.199 -0.151 -0.222 -0.133 -0.156 ...
##  $ ABCB4          : num  -0.197 -0.156 -0.22 -0.13 -0.155 ...
##  $ ABCB5          : num  -0.199 -0.146 -0.222 -0.109 -0.155 ...
##  $ ABCB6          : num  -0.146 -0.104 -0.142 -0.107 -0.12 ...
##  $ ABCB7          : num  -0.095 -0.0466 -0.1066 -0.0631 -0.0865 ...
##  $ ABCB8          : num  -0.132 -0.121 -0.061 -0.116 -0.101 ...
##  $ ABCB9          : num  -0.187 -0.153 -0.184 -0.126 -0.149 ...
##  $ ABCC1          : num  -0.1321 -0.1172 -0.1673 -0.0889 -0.1157 ...
##  $ ABCC10         : num  -0.171 -0.145 -0.193 -0.122 -0.138 ...
##  $ ABCC11         : num  -0.1283 -0.1565 1.7186 -0.0977 0.0901 ...
##  $ ABCC12         : num  -0.199 -0.158 -0.221 -0.133 -0.155 ...
##  $ ABCC13         : num  -0.199 -0.158 -0.222 -0.133 -0.157 ...
##  $ ABCC2          : num  -0.198 -0.155 -0.179 -0.132 -0.153 ...
##  $ ABCC3          : num  -0.174 -0.153 0.218 -0.104 -0.129 ...
##  $ ABCC4          : num  -0.19 -0.148 -0.155 -0.121 -0.139 ...
##  $ ABCC5          : num  -0.11476 -0.11086 0.00429 -0.08106 -0.1193 ...
##  $ ABCC6          : num  -0.1788 -0.0562 -0.199 -0.1232 -0.1543 ...
##  $ ABCC6P1        : num  -0.199 -0.154 -0.222 -0.133 -0.157 ...
##  $ ABCC6P2        : num  -0.182 -0.135 -0.201 -0.128 -0.151 ...
##  $ ABCC8          : num  -0.173 -0.158 -0.215 -0.132 -0.154 ...
##  $ ABCC9          : num  -0.192 0.1154 -0.1966 -0.0754 -0.1459 ...
##  $ ABCD1          : num  -0.1505 -0.0902 -0.1359 -0.1176 -0.1339 ...
##  $ ABCD2          : num  -0.199 0.029 -0.222 -0.127 -0.153 ...
##  $ ABCD3          : num  0.5876 -0.045 0.0518 -0.0398 0.1582 ...
##  $ ABCD4          : num  -0.1211 -0.0577 -0.1601 -0.0766 -0.0994 ...
##  $ ABCE1          : num  -0.027117 -0.058306 0.000237 -0.023469 -0.034776 ...
##  $ ABCF1          : num  0.0692 -0.0365 0.0503 -0.0449 0.0122 ...
##  $ ABCF2          : num  -0.09817 -0.07499 0.00521 -0.0469 -0.01712 ...
##  $ ABCF3          : num  0.00912 -0.06507 0.05667 -0.06629 -0.02372 ...
##  $ ABCG1          : num  -0.05843 -0.10104 -0.06015 -0.08155 0.00854 ...
##  $ ABCG2          : num  -0.194 -0.141 -0.218 -0.109 -0.144 ...
##  $ ABCG4          : num  -0.199 -0.158 -0.222 -0.133 -0.156 ...
##  $ ABCG5          : num  -0.198 -0.157 -0.222 -0.132 -0.157 ...
##  $ ABCG8          : num  -0.199 -0.156 -0.222 -0.133 -0.157 ...
##  $ ABHD1          : num  -0.185 -0.129 -0.219 -0.128 -0.151 ...
##  $ ABHD10         : num  -0.0615 -0.0821 -0.0651 -0.0569 -0.0889 ...
##  $ ABHD11         : num  0.2164 -0.1138 0.4513 -0.0728 0.0789 ...
##  $ ABHD12         : num  0.1153 -0.03582 0.12486 -0.028 0.00528 ...
##  $ ABHD12B        : num  -0.195 -0.155 -0.221 -0.132 -0.154 ...
##  $ ABHD13         : num  -0.161 -0.121 -0.175 -0.104 -0.119 ...
##  $ ABHD14A        : num  -0.0867 -0.0306 -0.0598 -0.0436 -0.0989 ...
##  $ ABHD14B        : num  0.1935 0.6735 0.3657 0.3112 0.0918 ...
##  $ ABHD15         : num  -0.1028 0.1831 -0.1039 -0.0999 -0.1128 ...
##  $ ABHD2          : num  0.000624 -0.076385 -0.083781 -0.018347 0.608161 ...
##  $ ABHD3          : num  0.27133 -0.06029 -0.13749 -0.1025 -0.00756 ...
##  $ ABHD4          : num  0.00962 -0.02106 -0.0368 -0.02489 -0.08412 ...
##  $ ABHD5          : num  -0.1259 0.1612 -0.1522 -0.0755 -0.1188 ...
##  $ ABHD6          : num  -0.1698 -0.0595 -0.205 -0.1087 -0.1336 ...
##  $ ABHD8          : num  -0.153 -0.136 -0.213 -0.112 -0.126 ...
##  $ ABI1           : num  -0.04009 -0.04579 0.00116 -0.00813 -0.02277 ...
##  $ ABI2           : num  -0.1113 -0.1125 -0.1295 -0.0391 -0.0759 ...
##  $ ABI3           : num  -0.174 -0.149 -0.2 -0.116 -0.107 ...
##  $ ABI3BP         : num  -0.1905 0.0569 -0.2169 0.2513 -0.0687 ...
##   [list output truncated]
str(phenotype)
## 'data.frame':    100 obs. of  2 variables:
##  $ sample   : chr  "TCGA-A7-A13G-01" "TCGA-A7-A13G-11" "TCGA-AC-A23H-01" "TCGA-AC-A23H-11" ...
##  $ phenotype: chr  "Tumor" "Normal" "Tumor" "Normal" ...
str(diagnosis)
## 'data.frame':    20 obs. of  2 variables:
##  $ sample   : chr  "Test1" "Test2" "Test3" "Test4" ...
##  $ phenotype: chr  "Normal" "Tumor" "Tumor" "Normal" ...
str(unknown_samples)
## 'data.frame':    20 obs. of  20501 variables:
##  $ A1BG           : num  -0.169 -0.148 -0.106 -0.185 -0.163 ...
##  $ A1CF           : num  -0.184 -0.158 -0.128 -0.196 -0.206 ...
##  $ A2BP1          : num  -0.183 -0.158 -0.128 -0.196 -0.206 ...
##  $ A2LD1          : num  -0.165 -0.146 -0.121 -0.181 -0.179 ...
##  $ A2M            : num  2.86 2.41 0.49 3.99 1.96 ...
##  $ A2ML1          : num  -0.184 -0.158 -0.12 -0.185 -0.206 ...
##  $ A4GALT         : num  -0.1497 -0.1223 -0.0989 -0.1509 -0.1251 ...
##  $ A4GNT          : num  -0.184 -0.156 -0.127 -0.196 -0.206 ...
##  $ AAA1           : num  -0.184 -0.158 -0.128 -0.196 -0.206 ...
##  $ AAAS           : num  0.02756 -0.03786 -0.01113 -0.07162 0.00615 ...
##  $ AACS           : num  -0.0733 -0.0753 -0.0818 -0.0461 -0.0736 ...
##  $ AACSL          : num  -0.184 -0.158 -0.128 -0.196 -0.206 ...
##  $ AADAC          : num  -0.169 -0.122 -0.127 -0.193 -0.206 ...
##  $ AADACL2        : num  -0.184 -0.155 -0.128 -0.196 -0.206 ...
##  $ AADACL3        : num  -0.183 -0.158 -0.128 -0.196 -0.206 ...
##  $ AADACL4        : num  -0.184 -0.158 -0.128 -0.196 -0.206 ...
##  $ AADAT          : num  -0.174 -0.146 -0.125 -0.159 -0.192 ...
##  $ AAGAB          : num  0.13071 0.01078 0.00651 -0.0076 0.04112 ...
##  $ AAK1           : num  -0.1061 -0.1055 -0.0762 -0.1196 -0.1744 ...
##  $ AAMP           : num  0.216 0.146 0.138 0.109 0.477 ...
##  $ AANAT          : num  -0.184 -0.158 -0.128 -0.196 -0.206 ...
##  $ AARS           : num  0.0819 0.0841 0.0847 0.0767 0.7381 ...
##  $ AARS2          : num  -0.158 -0.134 -0.1 -0.169 -0.166 ...
##  $ AARSD1         : num  -0.087579 -0.061706 0.000496 -0.06312 0.308675 ...
##  $ AASDH          : num  -0.127 -0.1112 -0.0986 -0.122 -0.1678 ...
##  $ AASDHPPT       : num  0.0385 -0.0161 -0.0379 -0.0283 -0.1271 ...
##  $ AASS           : num  -0.1499 -0.0922 -0.1223 -0.1059 -0.1744 ...
##  $ AATF           : num  -0.0435 -0.0337 -0.0362 -0.0534 0.2721 ...
##  $ AATK           : num  -0.182 -0.156 -0.126 -0.194 -0.203 ...
##  $ ABAT           : num  0.0109 -0.0936 -0.0411 -0.1361 -0.1642 ...
##  $ ABCA1          : num  -0.0322 -0.0723 -0.111 -0.0737 -0.175 ...
##  $ ABCA10         : num  -0.1532 -0.0807 -0.1276 -0.1277 -0.2049 ...
##  $ ABCA11P        : num  -0.151 -0.12 -0.107 -0.153 -0.182 ...
##  $ ABCA12         : num  -0.1819 -0.1529 -0.0755 -0.1699 -0.2032 ...
##  $ ABCA13         : num  -0.184 -0.157 -0.118 -0.193 -0.206 ...
##  $ ABCA17P        : num  -0.181 -0.157 -0.126 -0.195 -0.205 ...
##  $ ABCA2          : num  -0.1 -0.115 -0.058 -0.132 -0.142 ...
##  $ ABCA3          : num  -0.0964 -0.1074 -0.0657 -0.1389 -0.1053 ...
##  $ ABCA4          : num  -0.182 -0.156 -0.125 -0.185 -0.204 ...
##  $ ABCA5          : num  -0.1461 -0.0786 -0.1251 -0.1418 -0.1819 ...
##  $ ABCA6          : num  -0.1173 -0.0654 -0.125 -0.1239 -0.1995 ...
##  $ ABCA7          : num  -0.1542 -0.1432 -0.0728 -0.1734 -0.1712 ...
##  $ ABCA8          : num  -0.0872 0.0105 -0.1273 -0.0374 -0.2054 ...
##  $ ABCA9          : num  -0.1441 -0.0389 -0.1272 -0.0694 -0.2053 ...
##  $ ABCB1          : num  -0.162 -0.116 -0.127 -0.16 -0.205 ...
##  $ ABCB10         : num  -0.1006 -0.1052 -0.0946 -0.1278 -0.1697 ...
##  $ ABCB11         : num  -0.182 -0.158 -0.127 -0.196 -0.206 ...
##  $ ABCB4          : num  -0.177 -0.157 -0.126 -0.193 -0.202 ...
##  $ ABCB5          : num  -0.173 -0.145 -0.128 -0.153 -0.206 ...
##  $ ABCB6          : num  -0.1527 -0.1228 -0.0975 -0.1511 -0.1664 ...
##  $ ABCB7          : num  -0.0564 -0.0694 -0.0762 -0.0822 -0.1353 ...
##  $ ABCB8          : num  -0.12 -0.125 -0.101 -0.157 -0.106 ...
##  $ ABCB9          : num  -0.173 -0.15 -0.109 -0.181 -0.188 ...
##  $ ABCC1          : num  -0.0939 -0.0924 -0.0829 -0.1133 -0.1205 ...
##  $ ABCC10         : num  -0.162 -0.134 -0.101 -0.17 -0.147 ...
##  $ ABCC11         : num  -0.1631 -0.0973 -0.0151 -0.194 -0.2053 ...
##  $ ABCC12         : num  -0.184 -0.158 -0.126 -0.196 -0.206 ...
##  $ ABCC13         : num  -0.183 -0.158 -0.126 -0.196 -0.206 ...
##  $ ABCC2          : num  -0.182 -0.156 -0.125 -0.195 -0.206 ...
##  $ ABCC3          : num  -0.0978 -0.0596 -0.1077 -0.146 -0.1135 ...
##  $ ABCC4          : num  -0.1542 -0.1407 -0.0259 -0.1611 -0.1969 ...
##  $ ABCC5          : num  -0.0996 -0.0977 0.127 -0.091 -0.0583 ...
##  $ ABCC6          : num  -0.1682 -0.1371 -0.0997 -0.1613 -0.1948 ...
##  $ ABCC6P1        : num  -0.183 -0.155 -0.122 -0.194 -0.206 ...
##  $ ABCC6P2        : num  -0.177 -0.146 -0.117 -0.184 -0.189 ...
##  $ ABCC8          : num  -0.175 -0.157 -0.114 -0.192 -0.202 ...
##  $ ABCC9          : num  -0.1275 -0.1228 -0.1181 -0.0715 -0.1863 ...
##  $ ABCD1          : num  -0.153 -0.1393 -0.0999 -0.1684 -0.1611 ...
##  $ ABCD2          : num  -0.169 -0.142 -0.128 -0.164 -0.206 ...
##  $ ABCD3          : num  0.0838 0.0736 -0.0427 0.0639 -0.0901 ...
##  $ ABCD4          : num  -0.1009 -0.0762 -0.0943 -0.0849 -0.0797 ...
##  $ ABCE1          : num  -0.021129 0.000303 -0.052552 0.022748 -0.067289 ...
##  $ ABCF1          : num  -0.033 -0.0139 0.0228 -0.0322 0.1017 ...
##  $ ABCF2          : num  0.00915 -0.02894 -0.06029 -0.0497 0.00982 ...
##  $ ABCF3          : num  0.0144 -0.0383 0.0265 -0.0648 0.0874 ...
##  $ ABCG1          : num  -0.0215 -0.0861 -0.0807 -0.0719 -0.0356 ...
##  $ ABCG2          : num  -0.164 -0.13 -0.126 -0.184 -0.198 ...
##  $ ABCG4          : num  -0.184 -0.158 -0.127 -0.196 -0.206 ...
##  $ ABCG5          : num  -0.183 -0.158 -0.127 -0.196 -0.206 ...
##  $ ABCG8          : num  -0.184 -0.158 -0.126 -0.196 -0.206 ...
##  $ ABHD1          : num  -0.179 -0.153 -0.116 -0.187 -0.202 ...
##  $ ABHD10         : num  -0.00456 -0.04482 -0.06261 -0.05365 -0.12271 ...
##  $ ABHD11         : num  -0.0632 -0.0456 -0.0206 -0.0499 0.0969 ...
##  $ ABHD12         : num  0.1314 0.000523 0.102147 0.00089 0.154931 ...
##  $ ABHD12B        : num  -0.168 -0.157 -0.116 -0.194 -0.201 ...
##  $ ABHD13         : num  -0.122 -0.109 -0.111 -0.116 -0.168 ...
##  $ ABHD14A        : num  -0.1223 -0.1032 -0.0683 -0.0896 -0.0885 ...
##  $ ABHD14B        : num  0.046 0.182 0.153 0.32 0.414 ...
##  $ ABHD15         : num  -0.1257 -0.1141 0.1043 -0.1114 -0.0946 ...
##  $ ABHD2          : num  0.2467 0.01359 0.00566 -0.00185 2.04067 ...
##  $ ABHD3          : num  -0.0385 -0.0874 -0.0917 -0.1152 -0.1615 ...
##  $ ABHD4          : num  -0.0431 -0.0295 -0.0304 -0.0631 0.0866 ...
##  $ ABHD5          : num  -0.1149 -0.0785 -0.0935 -0.0647 -0.137 ...
##  $ ABHD6          : num  -0.135 -0.114 -0.116 -0.148 -0.175 ...
##  $ ABHD8          : num  -0.154 -0.127 -0.103 -0.17 -0.146 ...
##  $ ABI1           : num  0.0315 0.0563 -0.0259 0.0492 -0.0239 ...
##  $ ABI2           : num  -0.0515 -0.0145 -0.0892 -0.0192 -0.1032 ...
##  $ ABI3           : num  -0.141 -0.137 -0.114 -0.165 -0.167 ...
##  $ ABI3BP         : num  0.0679 0.0489 -0.1253 0.1435 -0.1983 ...
##   [list output truncated]

2.1 Problem II.1

Run PCA for dimension reduction on the 100 samples with known labels, and draw these 100 samples in a 2D plot. Do cancer and normal separate from the first two PCs? Would this be sufficient to classify the unknown samples?

z-score normalized data are provided in BRCA_zscore_data.txt. Phenotype data is in BRCA_phenotype.txt.

对已知标签的100个样本进行主成分分析(PCA)以进行降维,并将这100个样本绘制在二维图中。癌症和正常样本是否在前两个主成分上分开?这是否足以对未知样本进行分类?

z-score标准化数据提供在BRCA_zscore_data.txt文件中,表型数据在BRCA_phenotype.txt文件中。

## run PCA

# 执行PCA
brca_pca_result <- prcomp(zscore_data, center = TRUE, scale. = TRUE)

# 合并PCA结果和表型数据
brca_pca_data <- as.data.frame(brca_pca_result$x)
brca_pca_data$phenotype <- phenotype$phenotype

summary(brca_pca_result)
## Importance of components:
##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     104.5628 39.41533 33.57338 25.99085 22.71466 20.18413
## Proportion of Variance   0.5333  0.07578  0.05498  0.03295  0.02517  0.01987
## Cumulative Proportion    0.5333  0.60909  0.66407  0.69702  0.72219  0.74206
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     17.89812 17.56338 16.19269 15.06316 14.66006 14.27673
## Proportion of Variance  0.01563  0.01505  0.01279  0.01107  0.01048  0.00994
## Cumulative Proportion   0.75769  0.77273  0.78552  0.79659  0.80707  0.81702
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     13.46337 13.10836 12.52409 12.37768 11.79345 11.62266
## Proportion of Variance  0.00884  0.00838  0.00765  0.00747  0.00678  0.00659
## Cumulative Proportion   0.82586  0.83424  0.84189  0.84936  0.85615  0.86274
##                           PC19     PC20     PC21     PC22    PC23    PC24
## Standard deviation     11.1819 11.11545 10.67820 10.55494 10.0220 9.94502
## Proportion of Variance  0.0061  0.00603  0.00556  0.00543  0.0049 0.00482
## Cumulative Proportion   0.8688  0.87486  0.88043  0.88586  0.8908 0.89558
##                           PC25   PC26    PC27    PC28    PC29    PC30    PC31
## Standard deviation     9.78693 9.4978 9.43866 9.33434 8.95690 8.54872 8.45059
## Proportion of Variance 0.00467 0.0044 0.00435 0.00425 0.00391 0.00356 0.00348
## Cumulative Proportion  0.90026 0.9047 0.90900 0.91325 0.91716 0.92073 0.92421
##                           PC32    PC33    PC34    PC35    PC36    PC37    PC38
## Standard deviation     8.31590 8.19743 8.09015 8.02401 7.80610 7.24526 7.15088
## Proportion of Variance 0.00337 0.00328 0.00319 0.00314 0.00297 0.00256 0.00249
## Cumulative Proportion  0.92759 0.93086 0.93406 0.93720 0.94017 0.94273 0.94522
##                           PC39    PC40    PC41   PC42    PC43    PC44    PC45
## Standard deviation     6.94398 6.85277 6.78629 6.5563 6.54328 6.43537 6.30451
## Proportion of Variance 0.00235 0.00229 0.00225 0.0021 0.00209 0.00202 0.00194
## Cumulative Proportion  0.94758 0.94987 0.95211 0.9542 0.95630 0.95832 0.96026
##                           PC46    PC47    PC48    PC49    PC50    PC51    PC52
## Standard deviation     6.17937 6.01904 5.98582 5.83750 5.79709 5.59774 5.44917
## Proportion of Variance 0.00186 0.00177 0.00175 0.00166 0.00164 0.00153 0.00145
## Cumulative Proportion  0.96212 0.96389 0.96563 0.96730 0.96894 0.97046 0.97191
##                           PC53    PC54   PC55   PC56    PC57   PC58    PC59
## Standard deviation     5.42000 5.32816 5.1551 4.9512 4.82053 4.7392 4.66240
## Proportion of Variance 0.00143 0.00138 0.0013 0.0012 0.00113 0.0011 0.00106
## Cumulative Proportion  0.97335 0.97473 0.9760 0.9772 0.97836 0.9795 0.98051
##                           PC60    PC61    PC62    PC63    PC64    PC65    PC66
## Standard deviation     4.58210 4.50314 4.43727 4.21085 4.15021 4.00623 3.98938
## Proportion of Variance 0.00102 0.00099 0.00096 0.00086 0.00084 0.00078 0.00078
## Cumulative Proportion  0.98154 0.98252 0.98349 0.98435 0.98519 0.98597 0.98675
##                           PC67    PC68    PC69    PC70    PC71   PC72    PC73
## Standard deviation     3.86455 3.82101 3.73578 3.60320 3.54747 3.5115 3.41460
## Proportion of Variance 0.00073 0.00071 0.00068 0.00063 0.00061 0.0006 0.00057
## Cumulative Proportion  0.98748 0.98819 0.98887 0.98950 0.99012 0.9907 0.99129
##                           PC74    PC75    PC76    PC77    PC78    PC79    PC80
## Standard deviation     3.29442 3.22360 3.17530 3.13669 3.05530 2.95778 2.93320
## Proportion of Variance 0.00053 0.00051 0.00049 0.00048 0.00046 0.00043 0.00042
## Cumulative Proportion  0.99182 0.99232 0.99282 0.99330 0.99375 0.99418 0.99460
##                           PC81    PC82    PC83    PC84    PC85    PC86    PC87
## Standard deviation     2.89133 2.83490 2.80187 2.74540 2.66664 2.60667 2.53816
## Proportion of Variance 0.00041 0.00039 0.00038 0.00037 0.00035 0.00033 0.00031
## Cumulative Proportion  0.99501 0.99540 0.99578 0.99615 0.99650 0.99683 0.99714
##                           PC88    PC89    PC90    PC91    PC92    PC93    PC94
## Standard deviation     2.51787 2.41869 2.37368 2.29549 2.27132 2.25409 2.19206
## Proportion of Variance 0.00031 0.00029 0.00027 0.00026 0.00025 0.00025 0.00023
## Cumulative Proportion  0.99745 0.99774 0.99801 0.99827 0.99852 0.99877 0.99900
##                           PC95    PC96    PC97    PC98    PC99     PC100
## Standard deviation     2.15782 2.10125 2.06379 1.96352 1.81384 6.024e-14
## Proportion of Variance 0.00023 0.00022 0.00021 0.00019 0.00016 0.000e+00
## Cumulative Proportion  0.99923 0.99944 0.99965 0.99984 1.00000 1.000e+00

PCA1和PCA2分别解释了53.33%和7.58%的方差。在100个主成分中,前两个主成分解释了总方差的60.91%。

## view the first two PCs

# 绘制前两个主成分
ggplot(brca_pca_data, aes(x = PC1, y = PC2, color = phenotype)) +
    geom_point(size = 3) +
    labs(
        title = "PCA of Breast Tumor Samples",
        x = "Principal Component 1",
        y = "Principal Component 2"
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Normal" = "blue", "Tumor" = "red"))

我们在此绘制了包含50个肿瘤乳腺样本和50个正常乳腺样本的训练数据集的PCA图。重要的是,在进行PCA分析之前,我们对数据进行了中心化和标准化处理,因为PCA的功能是最大化所选成分的方差。由于不同基因在生物学上表达的方差不同,初步的标准化确保PCA能够更准确地捕捉到真正反映数据潜在方差的成分。如果我们在没有首先进行标准化的情况下进行PCA,PC1和PC2只能解释19.84%和13.84%的方差,而标准化后则分别解释了53.33%和7.58%的方差。因此,标准化防止了一些方差较大的基因主导结果中的第一主成分。

从上面通过表型着色的PCA图中,我们可以清楚地看到癌症和正常组织在前两个主成分上分开,因为每种颜色的样本大多聚集在一起。更具体地说,这种分离主要由PC2引起。粗略估计,正常样本的PC2值通常小于0.0,而肿瘤样本的PC2值大多大于0.0。虽然我们确实在前两个主成分上看到了分离,但这些信息不足以对未知样本进行分类,因为PCA仅是一个降维工具。由于它不依赖于样本标签提供的任何信息,它是一种无监督的机器学习方法,不能用于做出分类预测。如果我们想利用这些PCA结果来预测癌症与正常样本,首先需要通过将包含类别标签的PCA转换数据拟合到分类器上来采用监督学习方法。因此,PCA不能用于分类目的。

2.2 Problem II.2

Draw a plot showing the cumulative % variance captured from the top 100 PCs. How many PCs are needed to capture 90% of the variance?

绘制一张图,显示前100个主成分(PCs)累积解释的方差百分比。有多少个主成分可以捕获90%的方差?

# 计算每个主成分的方差解释比例
explained_variance <- brca_pca_result$sdev^2 / sum(brca_pca_result$sdev^2)

# 计算累积方差解释比例
cumulative_variance <- cumsum(explained_variance)

# 找到捕获90%方差的主成分数
num_components_90 <- which(cumulative_variance >= 0.90)[1]

# 制作数据框以绘制前100个主成分的累积方差
variance_data <- data.frame(
    PC = 1:100,
    CumulativeVariance = cumulative_variance[1:100]
)

# 绘制累积解释方差的图
ggplot(variance_data, aes(x = PC, y = CumulativeVariance)) +
    geom_line(color = "blue") +
    geom_point() +
    geom_hline(yintercept = 0.90, linetype = "dashed", color = "red") +
    geom_vline(xintercept = num_components_90, linetype = "dashed", color = "green") +
    labs(
        title = "Cumulative Explained Variance by Principal Components",
        x = "Number of Principal Components",
        y = "Cumulative Explained Variance"
    ) +
    annotate(
        "text",
        x = num_components_90,
        y = 0.5,
        label = paste("PC", num_components_90),
        color = "green",
        vjust = -1
    ) +
    theme_minimal()

# How many PCs are needed to capture 90% of the variance?
print(num_components_90)
## [1] 25

在上图中,我们用虚线红线标出了90%的方差。根据累积和数据,捕获90%方差需要25个主成分(PCs)。

这意味着只使用这25个主成分就可以捕获数据中90%的变化。

2.3 Problem II.3

Apply machine learning methods (KNN, logistic regression, Ridge regression, LASSO, ElasticNet, random forest, and support vector machines) on the top 25 PCs of the training data and 5-fold cross validation to classify the samples. caret and MASS already implemented all of the machine learning methods, including cross-validation, so calling each is only one command. In order to get consistent results from different runs, use set.seed().

应用机器学习方法(KNN、逻辑回归、Ridge回归、LASSO、ElasticNet、随机森林和支持向量机)对训练数据的前25个主成分进行分类,并使用5折交叉验证。caret和MASS包已经实现了所有的机器学习方法,包括交叉验证,因此调用每个方法只需一个命令。为了在不同运行中获得一致的结果,使用set.seed()

5折交叉验证(5-fold cross-validation)是一种用于评估机器学习模型表现的技术。它通过将数据集划分为5个相同大小的子集(称为折,folds),并在每次迭代中使用其中4个子集作为训练集,剩下的1个子集作为验证集进行模型评估。这个过程重复5次,每次选择不同的子集作为验证集,最终将这5次的验证结果取平均值,得到模型的总体表现。

# 统一随机种子
set.seed(2147483645)
brca_pca_data25 <- data.frame(brca_pca_result$x[, 1:25])
brca_pca_data25$phenotype <- phenotype$phenotype
rownames(brca_pca_data25) <- rownames(brca_pca_result$x)

#### Run models using 7-fold cross validation

models <- list()

# a) basic algorithms

# 定义5折交叉验证的控制
train_control <- trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE, # 启用类别概率
    summaryFunction = twoClassSummary # 使用ROC评估
)

# KNN模型
models[["KNN"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    method = "knn",
    trControl = train_control,
    tuneGrid = expand.grid(
        k = seq(3, 15, by = 2)
    )
)

# 逻辑回归模型
models[["Logistic"]] <- train(
    phenotype ~ ., # 公式格式,表示用所有特征预测目标变量
    data = brca_pca_data25, # 数据集
    metric = "ROC", # 评估指标
    method = "glmnet", # 使用广义线性模型
    family = "binomial", # 指定逻辑回归
    trControl = train_control # 交叉验证控制
)

# 岭回归模型
models[["Ridge"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "glmnet",
    trControl = train_control,
    tuneGrid = expand.grid(
        alpha = 1,
        lambda = seq(0.001, 1, length = 20)
    )
)

# LASSO模型
models[["LASSO"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "glmnet",
    trControl = train_control,
    tuneGrid = expand.grid(
        alpha = 1,
        lambda = seq(0.001, 1, length = 20)
    )
)

# ElasticNet模型
models[["ElasticNet"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "glmnet",
    trControl = train_control,
    tuneGrid = expand.grid(
        alpha = seq(0, 1, length = 10),
        lambda = seq(0.001, 1, length = 20)
    )
)

# b) advanced algorithms

# 随机森林模型
models[["RandomForest"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "rf",
    trControl = train_control
)

# 支持向量机模型
models[["SvmRadial"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "svmRadial", # 使用径向基函数
    trControl = train_control,
    tuneGrid = expand.grid(
        C = 2^(-5:5),
        sigma = c(0.01, 0.1, 0.5, 1, 2, 5)
    ),
)
## maximum number of iterations reached 0.001717356 -0.001687787maximum number of iterations reached 0.0004628694 -0.0004371481maximum number of iterations reached 0.002774584 -0.002716368maximum number of iterations reached 0.0005204418 -0.0004921501maximum number of iterations reached 0.0002733113 -0.0002651633maximum number of iterations reached 0.0003307313 -0.000310932maximum number of iterations reached 1.547719e-05 -1.363313e-05maximum number of iterations reached 0.00208915 -0.002061731maximum number of iterations reached 0.000667715 -0.0006310629maximum number of iterations reached 0.001942736 -0.001918292maximum number of iterations reached 0.0005113476 -0.0004838644maximum number of iterations reached 0.0001121293 -0.0001087907maximum number of iterations reached 9.726224e-05 -9.146786e-05maximum number of iterations reached 1.06284e-05 -9.362768e-06maximum number of iterations reached 0.0007537609 -0.0007402345maximum number of iterations reached 2.062845e-05 -1.937811e-05maximum number of iterations reached 0.002546196 -0.002510114maximum number of iterations reached 0.0009838597 -0.0009295142maximum number of iterations reached 2.436829e-05 -2.144109e-05maximum number of iterations reached 0.0002753308 -0.0002671222maximum number of iterations reached -4.441659e-05 4.17748e-05maximum number of iterations reached 1.331201e-05 -1.172617e-05maximum number of iterations reached 0.002326452 -0.002297521maximum number of iterations reached 0.0002083393 -0.0001966268maximum number of iterations reached 2.696776e-05 -2.370615e-05maximum number of iterations reached 0.002911077 -0.002845043maximum number of iterations reached 0.00119061 -0.001126587maximum number of iterations reached -6.178472e-05 5.994721e-05maximum number of iterations reached -3.173926e-05 2.985125e-05
models[["SvmLinear"]] <- train(
    phenotype ~ .,
    data = brca_pca_data25,
    metric = "ROC",
    method = "svmLinear",
    trControl = train_control
)

# for (model in names(models)) {
#     predictions <- predict(models[[model]], newdata = brca_pca_data25)
#     conf_matrix <- confusionMatrix(factor(predictions), factor(brca_pca_data25$phenotype))
#     accuracy <- conf_matrix$overall["Accuracy"]
#     print(accuracy)
# }

2.4 Problem II.4

Summarize the performance of each machine learning method, in terms of accuracy and kappa.

总结每种机器学习方法的表现,以准确度和kappa指数为指标。

# compare accuracy of models(use accuracy and kappa index to measure the performance of these models)

performance_summary <- data.frame()

for (model in names(models)) {
    # 对**原有数据**进行预测(???)
    predictions <- predict(models[[model]], newdata = brca_pca_data25)
    # 生成混淆矩阵
    conf_matrix <- confusionMatrix(factor(predictions), factor(brca_pca_data25$phenotype))
    # 输出准确度
    accuracy <- conf_matrix$overall["Accuracy"]
    # 输出 Kappa 指数
    kappa <- conf_matrix$overall["Kappa"]
    # # 将结果添加到表现总结数据框中
    performance_summary <- rbind(
        performance_summary,
        data.frame(
            Model = model,
            Accuracy = accuracy,
            Kappa = kappa
        )
    )
}

print(performance_summary)
##                  Model Accuracy     Kappa
## Accuracy           KNN     0.94 0.8801438
## Accuracy1     Logistic     1.00 1.0000000
## Accuracy2        Ridge     0.92 0.8398078
## Accuracy3        LASSO     0.95 0.8999199
## Accuracy4   ElasticNet     0.97 0.9400000
## Accuracy5 RandomForest     1.00 1.0000000
## Accuracy6    SvmRadial     0.99 0.9799840
## Accuracy7    SvmLinear     1.00 1.0000000
# 绘制准确度与Kappa值的点图
ggplot(performance_summary, aes(x = Accuracy, y = Kappa, label = Model)) +
    geom_point(size = 4, color = "blue") +
    geom_text(
        vjust = -0.5,
        size = 4,
        nudge_y = ifelse(performance_summary$Model == "RandomForest", 0.006,
            ifelse(performance_summary$Model == "Logistic", 0.010, 0.002)
        )
    ) +
    labs(
        title = "Model Performance: Accuracy vs. Kappa",
        x = "Accuracy",
        y = "Kappa"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))

该表格和上方的点图总结了从KNN到SVM所使用的所有机器学习方法的 Accuracy 准确率和 Kappa 统计量。

# 绘制调参热力图
ggplot(models[["ElasticNet"]]$results, aes(x = alpha, y = lambda, fill = ROC)) +
    geom_tile() +
    scale_fill_gradient2(low = "white", midpoint = 0.85, name = "ROC") +
    labs(
        title = "Elastic Net Hyperparameter Tuning",
        x = "Alpha",
        y = "Lambda"
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

我还绘制了一个热力图,显示了 ElasticNet 模型的超参数调优结果。这个图显示了不同 alpha 和 lambda 值的 ROC 值。ROC 值/曲线是一个二分类模型的评估指标,它显示了真阳性率(TPR)和假阳性率(FPR)之间的权衡。在这个图中,我们可以看到 alpha 和 lambda 的不同组合如何影响模型的性能。在这个图中,我们可以看到 alpha 和 lambda 的不同组合如何影响模型的性能。

2.5 Problem II.5: For Graduate students only

Compare the performance difference between logistic regression, Ridge, LASSO, and ElasticNet. In LASSO, how many PCs have non-zero coefficient? What is the lambda for Ridge and LASSO, respectively?

比较逻辑回归、Ridge、LASSO和ElasticNet之间的表现差异。在LASSO中,有多少个主成分(PC)具有非零系数?Ridge和LASSO的lambda值分别是多少?

# compare accuracy of models(use accuracy and kappa index to measure the performance of these models)

# 提取 LASSO 模型的最佳 lambda
best_lambda_lasso <- models[["LASSO"]]$bestTune$lambda

# 获取 LASSO 模型中非零系数的数量
nonzero_coeff_lasso <- coef(models[["LASSO"]]$finalModel, s = best_lambda_lasso)
nonzero_pc_lasso <- sum(nonzero_coeff_lasso != 0)

cat("在 LASSO 模型中,非零系数的主成分数量为:", nonzero_pc_lasso, "\n")
## 在 LASSO 模型中,非零系数的主成分数量为: 6
# LASSO 的最佳 lambda
best_lambda_lasso <- models[["LASSO"]]$bestTune$lambda
cat("LASSO 的最佳 lambda 值为:", best_lambda_lasso, "\n")
## LASSO 的最佳 lambda 值为: 0.05357895
# Ridge 的最佳 lambda
best_lambda_ridge <- models[["Ridge"]]$bestTune$lambda
cat("Ridge 的最佳 lambda 值为:", best_lambda_ridge, "\n")
## Ridge 的最佳 lambda 值为: 0.1061579

准确率是正确分类实例的百分比。Kappa 是基于准确率的一个指标,经过归一化以反映数据集中每个类别随机出现的可能性。当我们专门比较逻辑回归、Ridge、LASSO 和 ElasticNET 在这两个指标上的表现时,可以看到 ElasticNET 似乎优于其他模型。这个结果在与 Ridge 和 LASSO 比较时更加明显,ElasticNET 的平均准确率为 0.970,而 Ridge 和 LASSO 的分别为 0.920 和 0.950;平均 Kappa 值分别为 0.94、0.84 和 0.90。ElasticNET 的优越表现符合其模型特性,因为这三种方法都试图通过一个惩罚项来最小化残差平方和。该惩罚项有两个可调参数——alpha 和 lambda。我们调优了 alpha 和 lambda 来生成最准确的 ElasticNET 模型。相比之下,Ridge 回归中 alpha 必须设为 0,LASSO 中 alpha 必须设为 1,因此只有 lambda 可以调优。由于 ElasticNET 可以优化两个参数,而 Ridge 和 LASSO 只能优化一个参数,因此 ElasticNET 优于 Ridge 和 LASSO 是预料之中的。

在逻辑回归中,我们同样在最小化残差平方和,但没有应用任何惩罚项。从准确率和 Kappa 统计结果来看,逻辑回归似乎优于 Ridge 和 LASSO。惩罚项通常用于引入一些偏差,以减少方差并避免过拟合训练数据。在这里,我们的数据中出现了完美分离,这阻止了模型的收敛,因此出现了错误消息。然而,这种完美分离意味着我们的模型可以轻松拟合训练数据,从而表现出特别强的表现。因此,看到 LASSO 和 Ridge 引入的惩罚降低了准确率和 Kappa,相比于逻辑回归模型中的完美分离,我并不感到惊讶。

LASSO 模型输出了 6 个非零系数。然而,其中一个是截距。因此,在 LASSO 有 5 个主成分具有非零系数。在 ElasticNET 中,Ridge(定义为 alpha = 0)的 lambda 为 0.106,而 LASSO(定义为 alpha = 1)的 lambda 为 0.0536。

2.6 Problem II.6

Use the PCA projections in Q1 to obtain the first 25 PCs of the 20 unknown samples. Use one method that performs well in Q4 to make predictions. Caret already used the hyper-parameters learned from cross-validation to train the parameters of each method on the full 100 training data. You just need to call this method to make the predictions.

Expression data for the 20 unknown samples are provided in unknown_samples.txt.

使用问题1中的PCA投影来获取20个未知样本的前25个主成分(PC)。选择在问题4中表现良好的一种方法进行预测。Caret已经使用交叉验证中学到的超参数在全部100个训练数据上训练了每种方法的参数。你只需调用此方法进行预测即可。

20个未知样本的表达数据已提供在文件unknown_samples.txt中。

# Use the PCA projections in Q1 to obtain the first 25 PCs

# 表现总结
performance_summary <- data.frame()
# 将 unknown_samples 投影到前 25 个主成分上
test_pca_data <- data.frame(predict(brca_pca_result, newdata = unknown_samples)[, 1:25])

# 对测试数据进行预测
svm_predictions <- predict(models[["SvmLinear"]], newdata = test_pca_data) # 选用SVM模型

# 输出预测结果
print("Predictions for the 20 unknown samples:")
## [1] "Predictions for the 20 unknown samples:"
print(svm_predictions)
##  [1] Tumor  Normal Tumor  Normal Tumor  Normal Tumor  Normal Tumor  Normal
## [11] Tumor  Normal Tumor  Normal Tumor  Normal Tumor  Normal Tumor  Tumor 
## Levels: Normal Tumor

The predictions for the 20 test samples were determined by our trained SVM model after projecting the test data onto the eigenvectors of the PCs determined for the training data.

通过将测试数据投影到为训练数据确定的主成分(PC)的特征向量上,我们的训练好的SVM模型确定了20个测试样本的预测结果。

2.7 Problem II.7: For Graduate students only

Can you find out the top 3 genes that are most important in this prediction method in Q6? Do they have some known cancer relevance?

你能找到在问题6中该预测方法中最重要的前三个基因吗?它们是否与某些已知的癌症相关?

gene_importance <- rowSums(abs(brca_pca_result$rotation[, 1:25]))
# 找到第一个主成分中权重最大的三个基因
important_genes <- names(sort(gene_importance, decreasing = TRUE))[1:3]

print(important_genes)
## [1] "C20orf199" "DEXI"      "DDX27"

这道题不太会。。。

2.8 Problem II.8

Suppose a pathologist later made diagnosis on the 20 unknown samples (load the diagnosis.txt file). Based on this gold standard, draw an ROC curve of your predictions in Q6. What is the prediction AUC?

假设病理学家后来对这20个未知样本做出了诊断(加载diagnosis.txt文件)。基于这个“金标准”,绘制你在问题6中的预测的ROC曲线。预测的AUC是多少?

# load the diagnosis file (has loaded before)
# diagnosis <- read.table("/home/huzongyao1/HW/HW2/raw_data2/diagnosis.txt", header = TRUE)

## ROC

svm_predictions_prob <- predict(models[["SvmLinear"]], newdata = test_pca_data, type = "prob")
svm_tumor_roc <- roc(diagnosis$phenotype, svm_predictions_prob$Tumor)

plot(svm_tumor_roc, col = "blue", main = "ROC Curve for SVMLinear Model")

cat("AUC:", auc(svm_tumor_roc), "\n")
## AUC: 0.8989899

预测的AUC为0.899,这表明我们的分类器相当不错。它可以在89.9%的情况下正确分类样本,虽然不是完美的(AUC = 1),但远好于随机分类(AUC = 0.5)。